This assignment is for ETC5521 Assignment 1 by Team Hakea comprising of Dang Thanh Nguyen and Rui Min Lin.

1 Introduction and motivation

Coffee is the most popular beverage in the whole world, as an everyday beverage people are attracted not only by its taste, but also the continuous refreshing effect that helps people stay focused. Let’s see, you’re reading this with a cup of coffee, aren’t you?

So wouldn’t it be interesting to know how the coffee you drink is graded to be either the best or bad coffee batches? The country with the best graded coffee beans and why are they considered the best, the different factors which elucidate the quality of the coffee cultivated. From where they come, the processing method to get them?

2 Data description

Source of data : This data originally comes from Coffee Quality Institute website and was scraped by and used by a github account by James DeLoux ,this data was then re-posted on Kaggle . And this dataset was analysed by Yorgos Askalidis.

Collection Methods: This dataset contains the review of 1312 arabica and 28 robusta beans from the Coffee Quality Institute’s trained reviewers.

Time Frame of dataset: January, 2018

The original data is a dataframe scraped by James LeDoux from the Coffee Quality Institute website which has a few missing values columns within it, so the author has cleaned the dataset by removing the variables: “view_certificate_1”, “view_certificate_2”,etc.

Since both the datasets (Arabica and Robusta) are just two different species of coffee , they were joined to produce the dataset, which we now use “coffee_ratings.csv” with 1339 observations and 43 variables.

2.1 Data limitation

  • As the number of graded coffee beans differ largely from country to country, some of the analysis will be biased.
  • For US, there are 3 areas that produce coffee beans: Mainland, Puerto Rico and Hawaii. In this research, the researcher merge all this areas together to better represent the country.

Structure of data:

After this knowing what each of those variables define with respect to our topic is important so here are a few variables we should know the general meaning to in accordance with coffee:

  • Aroma(aroma grade) : Has both fragrance (ground beans) and aroma (hot water with coffee powder)

  • Aftertaste : length of positive flavor remaining after the coffee is swallowed.

  • Acidity : the score depends on the origin characteristics and other factors(degree of roast)

  • Uniformity : refers to the consistency of flavor . 2 points are awarded for each cup displaying this attribute, with a maximum of 10 points if all 5 cups are the same.

  • Clean_cup : refers to a lack of interfering negative impressions from first ingestion to final aftertaste, a “transparency” of cup. 2 points are awarded for each cup displaying the attribute.

  • Cupper-points : The cupper marks the intensity of the Aroma on a scale. A final score for Fragrance and Aroma is now given on the basis of a combined evaluation of Fragrance and Aroma.

  • Category 1 defect: Full black or sour bean, pod/cherry, and large or medium sticks or stones.

  • Category 2 defect: Parchment, hull/husk, broken/chipped, insect damage, partial black or sour, shell, small sticks or stones, water damage.

  • Quakers: Quakers are unripened beans that are hard to identify during hand sorting and green bean inspection.

Questions of interest

The aim of this report is to discover characteristics within best-graded coffee bean countries, and will examine from different aspects of coffee beans to explore the likely factors that influence its quality and taste.

Secondary question:

  1. Which Country produces the best quality coffee beans?

  2. Does Altitude really affect the quality of the beans produced?

  3. Which countries perform best on individual grading criteria such as aroma, acidity, sweetness etc?

  4. Which regions/companies perform better than others in the quality of the coffee beans produced, intra-country?

  5. Compare the characteristics of Arabica and Robusta, what insights can be discovered?

  6. Are there any similarities in the processing method of coffee beans amongst the best-graded coffee beans countries?

3 Explatory Data Analysis

Coffee beans are harvested, produced and exported throughout almost every country in the world. This dataset contains the data of Ethiopia, Guatemala, Brazil, Peru, United States, United States (Hawaii), Indonesia, China, Costa Rica, Mexico, Uganda, Honduras, Taiwan, Nicaragua, Tanzania, United Republic Of, Kenya, Thailand, Colombia, Panama, Papua New Guinea, El Salvador, Japan, Ecuador, United States (Puerto Rico), Haiti, Burundi, Vietnam, Philippines, Rwanda, Malawi, Laos, Zambia, Myanmar, Mauritius, Cote d?Ivoire, NA, India. We will be focusing on the manufacturing and the quality aspect of the beans produced in this report. The two main variants of a coffee bean are Arabica and Robusta. Approximately 60% of coffee produced in the world is Arabica and approximately 40% is Robusta. Arabica beans consists about 0.8%-1.4% caffeine and Robusta beans consists of 1.7%-4% caffeine. Coffee is one of the most important cash crop in the world. Wikipedia

3.1 Best Quality Beans

The Coffee Quality Institute is a non-profit organization that grades coffee samples from around the world in a consistent and professional manner.

The coffee beans are graded by the Coffee Quality Institute’s trained reviewers. The total rating of a coffee bean is a cumulative sum of 10 individual quality measures: aroma, flavour, aftertaste, acidity, body, balance, uniformity, clean cup, sweetness and cupper points. Each grade is on a 0–10 scale resulting to a total cupping score between zero and one hundred. Figure 3.1 aims to address the primary question Which country produces best quality coffee beans?. X axis shows the country while Y axis denotes the overall rating achieved by the coffee bean. It is clear that Ethiopia produced the highest quality of coffee beans. However, it is interesting to note that there is not much variation between countries as most of them have median score of around 80-85 points. Thus, We can conclude that based on the dataset, there is not much difference in coffee quality between countries, with Ethipoia produces the highest-quality beans.

Figure 3.1: Boxplot for total ratings of coffee beans by country

3.2 Altitude v/s Quality

Next, we will will probe into the next question Does Altitude really affect the quality of the beans produced? and to explore this we have fitted a linear model. total_cup points is the dependent variable and altitude_mean_meters as the independent variable. We have considered only the top 1105 observations for constructing the model because 3 extreme values and missing values were rendering a statistically insignificant model with a p-value of more than 0.9. The fitted model returns a p-value of 0.0003 which suggests that the model is statistically significant. Figure 1 shows a positive relationship between altitude and quality of coffee beans produced. Altitude plays an important role in the formation of acidity and bitterness and enhances coffee quality attributes.

Table 3.1: Altitude v/s Quality stats
term estimate std.error statistic p.value
(Intercept) 81.0500388 0.3139834 258.134802 0.0000000
altitude_mean_meters 0.0008003 0.0002221 3.604186 0.0003271
Regression model for Altitude

Figure 3.2: Regression model for Altitude

3.3 Processing Method v/s Quality

To check if the processing method affects the quality of coffee beans produced, we have taken help of the ANOVA test as the processing method is a categorical variable. The ANOVA test returns p-values very far away from the confidence interval of 5% which can be observed when we plot the residuals against the fitted values in Figure 2 . Hence it is established that the processing methods used in producing the coffee beans does not influence the quality of beans produced.

##                     Df Sum Sq Mean Sq F value Pr(>F)  
## processing_method    4     58  14.468   1.978 0.0956 .
## Residuals         1164   8514   7.314                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 170 observations deleted due to missingness
ANOVA test for processing methods vs Quality

Figure 3.3: ANOVA test for processing methods vs Quality

3.4 Defects v/s Quality

After the linear model for altitude turned out to be insignificant, we figured there are several other variables in the dataset that we could try fitting a model. The dataset contains category one and category two defects which are also known as primary and secondary defects and we fitted a muti-variate model using the same. The model after considering both the variables return a p-value very close to 0 and hence this model is considered a good one and as can be seen in Figure 3 which suggest that almost all the residuals reside very close to the 0 line with a very few outliers. Thus it can be understood that defects influence the quality of coffee beans produced.

Table 3.2: Defects linear regression model
term estimate std.error statistic p.value
(Intercept) 82.5885115 0.1122747 735.593003 0.00000
category_one_defects -0.1106967 0.0378995 -2.920796 0.00355
category_two_defects -0.1252918 0.0181894 -6.888192 0.00000

3.5 Individual Criteria

To check which criteria the top 5 countries perform best in we have used radar charts from Figure 4 onwards. A radar chart is a useful way to depict multi-variate observations. Each criteria is rated out of a total 10 points and all the 10 criteria are plotted together on the radar chart along with moisture percentage to understand how a particular country performs on individual criteria. The top-5 coffee bean producing countries according to our analysis are Ethiopia, Brazil, United States, Indonesia and Peru.

Table 3.3: Means for different Individual grading criterias
country_of_origin ma mfl maf mac mb mba mu mc ms mcu mm
Brazil 7.606667 7.548182 7.363939 7.464545 7.532727 7.543030 9.757273 9.69697 9.939091 7.497576 0.0803030
Ethiopia 8.001429 8.154286 7.892857 8.154286 7.930000 8.012857 9.904286 10.00000 10.000000 8.141429 0.0885714
Indonesia 7.682000 7.416000 7.200000 7.214000 7.600000 7.230000 9.866000 10.00000 9.866000 7.268000 0.0700000
Peru 7.446667 7.333333 7.223333 7.386667 7.530000 7.446667 9.776667 10.00000 10.000000 7.306667 0.1100000
United States 7.790000 7.875000 7.670000 7.875000 7.790000 7.670000 9.665000 9.66500 8.710000 7.835000 0.0000000

After looking at these plots, the conclusion drawn are as follows: The common characteristics that these top 5 countries have are the consistent higher values of uniformity and clean cup. Among all these countries, it can be seen that the country Ethiopia has the highest values for all the different characteristics that we have proven to have a significant affect on the quality of the coffee beans in the above sections. It is also interesting how the sweetness has a perfect score of 10 in all other countries other than United States as depicted.

3.6 Leading Regions

To respond to our final question Which regions/companies perform better than others in the quality of the coffee beans produced, intra-country? we use a barplot in Figure 6. The x-axis shows the total points a coffe bean recieves and the y-axis depicts the different regions where they were produced. The bars are coloured according to the country they belong to. Evidently Hulia is a region in Colombia where coffee beans get consistent good ratings and is the best region to grow coffee beans.

Best Regions for the quality of coffee beans

Figure 3.4: Best Regions for the quality of coffee beans

3.7 Largest Producer

We also observe that Colombia is the largest producer of coffee beans Arabica variant with more than 40000 bags followed by Guatemala and Brazil. We establish using a bar-plot in Figure 7 with y-axis representing the number of bags produced and x-axis representing different countries. The Figure shows the top-6 coffee bean producing countries.

Largest producer country-wise

Figure 3.5: Largest producer country-wise

4 References